Multi-Modal Tracking

ASTR: Efficient Multi-Modal Tracking with Asymmetric Transformer

Overview

ASTR proposes an asymmetric RGB-X tracking architecture that consists of a multi-layer encoder and a single-layer decoder. It achieves strong tracking performance with fewer parameters and FLOPs compared to conventional dual-stream trackers, making it efficient and accurate for RGB-Thermal, RGB-Event, and RGB-Depth tracking tasks.

Fig 1. The proposed ASTR architecture with asymmetric encoder-decoder design

Key Contributions:

Asymmetric Transformer: Combines a heavy multi-layer encoder for powerful feature embedding with a lightweight single-layer decoder for low computational cost.
N-token Modal Blending: Learns multiple mixing ratios of RGB and X-modality (e.g., thermal, depth, event) tokens to generate diverse template representations.
Online Template Update: Uses classification confidence score to adaptively update templates, enhancing robustness to target appearance changes.

Decoder and Modal Mixing:

The decoder applies cross-attention between search embeddings and modal-blended template embeddings. Modal blending is performed via learnable softmax-based weights on RGB and X modality features, creating N diverse representations.

Performance & Efficiency:

ASTR outperforms state-of-the-art trackers in LasHeR, VisEvent, RGBT234, and DepthTrack datasets with up to 55.6% fewer FLOPs and competitive tracking accuracy.

Fig 2. ASTR vs. other RGB-X trackers on multiple datasets